This dataset explores the characteristics of about ~5000 white wines. Each wine is graded on a scale from 0 (very bad) to 10 (excellent). Additionnally to this grading, the dataset also gives several attributes of those wines, such as acidity, residual sugar, pH…
The goal of this analysis is to understand better what makes a good white wine, and how its ranking can be explained by such characteristics.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Our dataset consists of 4898 white wines evaluated on 12 different characteristics, 11 being the chemical characteristics of the wines (such as acidity, density, pH, etc) and the last one being the overall quality of the wine as assessed by experts.
Let’s first assess the distribution of the quality of our wines:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
## x freq
## 1 3 20
## 2 4 163
## 3 5 1457
## 4 6 2198
## 5 7 880
## 6 8 175
## 7 9 5
The median value for wine quality is 6/10, with 2198 wines with that rating. Only 5 wines in our sample are ranked as 9/10 in quality, and none of them made it to 10/10. Similarly, no wine is ranked lower than 3 out of 10.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Most wines seem to have an acidity in between 6 and 8 g / dm^3, with outliers up to 14.2 g / dm^3. Let’s zoom in by limiting the axes in our graph:
The distribution is skewed right, with a peak in between 6.5 and 7 g / dm^3.
Let’s visualize those outliers along with the general distribution of the data in a boxplot:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Volatile acidity levels are comprised in between 0.2 and 0.4 g / dm^3, with some outliers presenting over 0.9 g / dm^3. Overall it is still quite evenly distributed, with the median and the mean of that variable being quite close (0.26 for the median and 0.2782 for the mean). If we zoom in:
We see indeed a peak in our distribution at around 2.5 g / dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
The distribution here is also bell-shaped, with the median and the mean close to one another (0.32 g / dm^3 for the median vs 0.3342 g / dm^3 for the mean). The values range from 0 to 1.66 g / dm^3. In the graph we can see a peak at about 0.5 g / dm^3. Let’s zoom in:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Interesting to see that we have over 200 wines showing a citric acid amount of exactly 0.49 g / dm^3, while the other amounts close to it (0.46 to 0.52 for instance) have much lower wine counts. The table above shows the metrics for this subset of wines with citric acid of 0.5 g / dm^3, and this particularity does not seem to have an impact of their quality ratings (ranging from 3 to 9 out of 10).
From the notes on the dataset, we know that “found in small quantities, citric acid can add ‘freshness’ and flavor to wines”. That seems to indicate a different way of looking at it than the two previous acidity measures; in the bivariate analysis section, we could try here to estimate what is a good “small quantity” by comparing citric acid concentrations to wine quality.
After studying acidity, let’s now look into sugar amounts in our wine:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Interestingly, a very large number of our wines seem to have rather low amounts of residual sugar. The mean of our dataset is at 6.391 g / dm^3 while the maximum possible value would be at 65.8 g / dm^3. Let’s zoom in on our dataset with residual sugar concentrations between 0 and 20 g / dm^3:
We see here that the distribution is long-tailed. Let’s see what we get using the log transform:
The distribution now looks bimodal.
My primary assumption would be that regarding white wines, a higher amount of residual sugar could mean better quality rankings - even though we do not know what type of white wines those Vinho Verde are.
Chlorides refer to the amount of salt in the wine:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
The distribution here is very long tailed. Let’s apply a log transform:
The log transform makes the variance decrease significantly and the chlorides distribution now appears normal.
Free Sulfur Dioxide “prevents microbial growth and the oxidation of wine”.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
The distribution and the boxplot show an outlier at 289 mg / dm^3. Let’s remove it by limiting the axes:
The distribution now appears very close to normal, although slightly skewed to the right.
Total sulfur dioxide is the sum of the amounts of free and bound forms of sulfur dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
We see a symetric, bell-shaped distribution for our total sample. There is also an outlier at 440 mg / dm^3, but this is probably caused by the outlier in the free sulfur dioxide distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Density also present an outlier at 1.04. Let’s remove it and zoom in:
Most of the wines seem to have a density in between 0.99 and 1.00, meaning close to the density of water.
According to the notes, pH ranges from a scale of “0 (very acidic) to 14 (very basic)”.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
In our dataset, the wines have a pH in between 2.72 and 3.82. The distribution is symetric and bell-shaped.
According to the notes attached to our initial dataset; sulphates act as a “wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant”.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
The distribution is bell-shaped, a bit long tailed. The range of values is quite wide, going from 0.22 and 1.08.
Using log transform makes the distribution normal, thus eliminating most of the variance.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
The boxplot shows an interesting distribution here, with a majority of datapoints that seem to be outside the second and thrid quartiles.
Applying a log10 to the variable does not make the distribution very normal. We’ll investigate this variable further in the next sections.
Our dataset consists of 4898 white wines evaluated on 12 different characteristics. 11 of those characteristics are quantitative chemical observations for the wines (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol), while the last observation refers to their rankings in quality as assessed by experts (on a scale from 1 to 10).
Given that this analysis is aiming to understand which variables are the biggest drivers in wine quality, quality is definitely the main feature of interest.
While all 11 variables may have a role to play in the final quality of the wine, the univariate plots section helped me identify the key features I’ll focus on: acidity, residual sugar, chlorides and sulfur dioxide. I may also include alcohol if I see anything of interest.
I’m also interested in how those variables behave relatively to one another: for instance, I would assume that the wines with the highest proportion of residual sugar have the lowest acidity amounts; similarly, it can also impact their chlorides (chlorides being salt) amounts.
I did not create any variable. In the rest of the analysis, I may create additional variables to combine the acidity metrics (fixed and volatile) into one.
Most of the distributions except one presented a bell-shaped, normal-looking distribution, even though some of them were slightly skewed to the left or right of the x axis. I applied a log transform to some variables - residual sugar, chlorides, sulphates, and alcohol. The only remaining non-normal distributions after the log transform were the one of the residual sugar variable, whose histogram looked bi-modal, and the alcohol one.
Other adjustments that I regularly made was changing the bin width of the histograms, and “zooming in”" on the distibution by limiting the x-axis to remove outliers.
Let’s first remove the variable “x” from our dataset as it does not add any value to the analysis: the wines dataset now has 12 variables instead of 13.
Now we can use ggcorr to look at the variables’ correlation coefficients:
(An PNG image of this output is included in the submission)
Interestingly, the strongest correlations in regard to quality are chlorides (-0.21), density (-0.307) and alcohol (0.436). While quality is negatively correlated with chlorides and density, there is a strong positive correlation between quality and alcohol. I’ll investigate this further in the next section, “Biggest drivers of Quality”.
Alcohol itself is strongly negatively correlated with residual sugar (-0.451), total sulfur dioxide (-0.449) and density (-0.78). Given that alcohol seems from this chart the biggest driver in quality, I’ll also investigate the relationship between alcohol and those three variables in a second section, “Biggest drivers of Alcohol”.
Regarding the correlations between the other variables in the dataset: contrary to expectations, residual sugar has low correlation levels with fixed acidity (0.089) or chlorides (0.0887). However, it is most strongly linked to density (0.839) and total sulfur dioxide (0.401). VOlatile and fixed acidity variables, however, do not have any strong correlations with any other variables in the dataset, not even within themselves.
Let’s plot together quality and chlorides on different boxplots according to quality ratings:
To gain further visibility, I’ll create a new column, “rating”, which will take the value “low” if quality <= 4, “high” if quality >= 8, and average for everything in between. This is very similar to our two subsets of the previous section, but will allow us to have everything in one dataframe.
##
## average high low
## 4535 180 183
I have the same numbers of high and low quality wines as before.
Next, I’ll create again a scatterplot and three boxplots according to those wines quality rankings:
## wines$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04589 0.05000 0.34600
## --------------------------------------------------------
## wines$rating: high
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01400 0.03000 0.03550 0.03801 0.04400 0.12100
## --------------------------------------------------------
## wines$rating: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01300 0.03750 0.04600 0.05056 0.05400 0.29000
From the boxplots, we can see that while there is no visible difference between the average and the low quality boxplots, the high quality one presents a lower chlorides level.
## wines$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9918 0.9938 0.9941 0.9962 1.0390
## --------------------------------------------------------
## wines$rating: high
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0010
## --------------------------------------------------------
## wines$rating: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9960 1.0000
Again here, the high quality wines tend to have lower densities than the average or low quality wines. This is interesting because we also saw that density is strongly correlated to residual sugar amounts (0.839), so lower levels of sugar can have a big impact for low density levels. I’d say that the correlation between density and quality is not as much a proof of causality than sugar and quality or alcohol and quality may be. Low density seems to me a consequence rather than a cause of high quality in wines.
## wines$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.40 10.30 10.48 11.30 14.20
## --------------------------------------------------------
## wines$rating: high
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.65 12.60 14.00
## --------------------------------------------------------
## wines$rating: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.40 10.10 10.17 10.80 13.50
Here the difference between high quality wines and the other ones is the strongest visually: high quality wines clearly have higher alcohol levels than average ones, which also have higher levels than the low quality wines. Because this is the strongest correlation found here in regard to quality, let’s analyze further what seems to be driving high levels of alcohol in wines.
## [1] -0.4506312
The correlation between alcohol and residual sugar is -0.4506, illustrated by the regression line in the scatter plot above. I removed the only sweet wine in our dataset from this visualization as it is an outlier.
## [1] -0.4488921
Alcohol and total sulfur dioxide are also negatively correlated at -0.4488. In the graphs above, I limited the y-axis to 350 mg / dm^3, exclusing one data point.
## [1] -0.7801376
The graph above excludes the data point at 1.04 density.
The strongest correlation with alcohol is again density, at -0.7801. Again I am going to go with the assumption that this illustrates density as a consequence of higher alcohol an higher sugar levels and not a cause of high quality in wines.
Our main variable of interest was quality. We can see that quality has the strongest correlations with chlorides, density, and alcohol, alcohol being the strongest link of all. It is also the only positively correlated variable, chlorides and density being negatively correlated with quality.
Because the link between quality and alcohol was so strong, I also analyzed the biggest drivers in alcohol levels, mainly residual sugar, total sulfur dioxide, and density. Again the strongest link was with alcohol and density, but I’d interpret this as a correlation and not a causation.
It was between density and residual sugar, with a correlation of 0.839.
While our initial dataset did not have any categorical variable, we created one - rating - in the previous sections. Another way to gain further insights into our dataset would be to convert one of our features as a categorical variable, instead of numerical.
I’ll focus on the 3 biggest drivers of quality: chlorides, density, and alcohol, and the 2nd strongest correlation with alcohol apart from density: residual sugar.
I’ll split all those variables following their quartiles values, except for residual sugar which had shown to have a bimodal distribution. I’ll split this one at its median point.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
## Length Class Mode
## 0 NULL NULL
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
We also removed the rows with NA values, to reach 4893 observations.
We can clearly see how alcohol levels impact density, while chlorides levels do not have that same impact. The correlation between alcohol and chlorides is getting stronger and stronger as density decreases.
This chart illustrates further how those three variables relate to quality: high quality wines showing higher alcohol levels, density, and lower chlorides amounts than the average and low quality wines. Intrestingly, lower quality wines also have on average lower levels of chlorides than average quality wines, which seems to indicate that after a certain threshold it is not such a big factor of quality anymore.
The biggest correlations with density are residual sugar (0.839), total sulfur dioxide (0.53), and alcohol (-0.78).
First, the relationship between residual sugar and total sulfur dioxide, with alcohol as a colour (limiting the x-axis - residual sugar - to 30 g / dm^3 to exclude the outlier, the sweet wine):
Alcohol levels seem to become lower and lower and sugar amounts increase.
We saw previously that the residual sugar distribution was bimodal; separating the variable by the median (so that we have the two “peaks” isolated), we see here that the density and total sulfur dioxide levels are also shifted to the right and up in the chart on the right: wines with higher residual sugar levels also have higher density (which is not surprising since the two are strongly correlated) and total sulfur dioxide levels.
Density, sugar and total sulfur dioxide are also the most impactful variables for alcohol, which is the most impactful variable for quality (we’ll use rating here).
We see even more clearly the relationship between density and residual sugar by looking at the colors of the chart: the darker color (higher residual sugar) grouped on the right of the graph with the x-axis for density. The distribution of total sulfur dioxide also seems more grouped towards the bottom of the chart compared to the low quality sample.
Density and residual sugar seem to go hand by hand with one another, such as alcohol levels and quality. Chlorides and total sulfur dioxide impacts are less easy to read, but also visible from the charts.
Density is definitely a variable that is impacted by a few other factors, which is also why I think it comes up as so strongly correlated to quality. Chlorides and total sulfur dioxide are also showing surprising relationships.
This boxplot shows the different distribution for the wines in scope related to their quality rating (0 = lowest rating possible, 10 = highest rating possible). We can see clearly that high quality wines have higher levels of alcohol than the low quality wines; alcohol is the biggest driver of quality for this dataset.
The second biggest driver of quality in our dataset is density: both variables are negatively correlated for -0.307, as shown in the boxplots above. As explained previously, to me low density is not a cause of good quality but rather a consequence of something else - and it is correlated at 83.9% with residual sugar amounts. This is illustrated in the second plot, with a clear regression line showing that density increases with residual sugar amounts.
In addition to alcohol and density, chlorides is the third biggest driver of quality for white wines. We can see above how those three variables interact with each other in regard to quality (or rating).
This project aimed to help me put the pieces of exploratory data analysis together, and it certainly was a great introduction to it. Starting from a dataset on a csv file, we had to clean, analyze and visualize the data in order to understand the relationship between different features of white wines, and most importantly quality. The fact that the dataset was already cleaned and formatted made it easier to get started and focus on the analysis / visualization part, while at the same time giving a complete view of those types of projects. On a technical standpoint, it also made me more comfortable with using R - whether it is looking up the libraries’ documentations, improving charts, or creating new variables from my dataset.
For one thing, while having a dozen variables do not seem like much, it can become time-consuming to visualize each and every one of them before narrowing down to a few most important ones. The correlation matrix of the second section was definitely a great help for this, while the first section for univariate plots was more repetitive - plotting the distribution, looking at its shape or its outliers, analyzing a quartile values, and so on.
The second main struggle I encountered was choosing the types of plots that would better illustrate my analysis. In the first submission of this document for instance, I was doing boxplots with two numerical variables, and most of my visualizations were histograms and boxplots. I tried in this submission to give it more variety, with scatter plots or multiariate plots using regression lines for instance, even though there are still possible improvements to be made.
One success for me was to be able to make sense of both the distributions and the quartiles or correlation metrics, and build a story around this dataset that makes sense. Another one was to use R code chunks to improve my dataset, whether this was by creating categorical variables out of numerical ones, or creating vector to label axes differently for instance.
As mentioned above, adding more diversity to the type of visualizations could be a big improvement to this analysis. Another one could be to analyze more features in the multivariate section, instead of just focusing on the three or four most important ones. An final improvement could be to try to build a linear model to predict quality based on multiple features, something I did not tackle in this project.
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!